library("install.load")
install_load("tidyverse", "ggplot2", "readr", "plotly", "olsrr")

taxlotdata <- read_csv("~/Desktop/RFiles/689 Assigments/taxlot_sample.csv")
taxlot_condo <- taxlotdata %>% 
  filter(condo == 1)

Question 1

depedent var = totalval

independent var = bldgsqft

g <- taxlot_condo %>% 
  ggplot(aes(BLDGSQFT, TOTALVAL, text = FID)) +
  geom_point()
ggplotly(g, tootltip = c("FID"))

FID of outliers: 2613, 2647, 1110, 2128

All four of these outliers have massive building footprints, over 40,000ft2, much higher than any of the other buildings in the dataset. They don’t seem to be part of the same data and might skew the results if included.

newdata <- taxlot_condo %>% 
  filter(BLDGSQFT <= 40000, 
         TOTALVAL>0)
  ggplot(newdata, aes(BLDGSQFT, TOTALVAL)) +
  geom_point()

Question 2

log-lin - this seems to be a good fit

log-log - this looks like a better fit

lin-log - this looks like a quadratic fit

To run an OLS test we need a linear relationship between the variables. I can’t really tell which is a better fit between the log-lin and the log-log by this visual test, but it’s obvious that the totalval variable needs to be transformed using log. I’m unsure of transforming building square footage using log would result in a better model. Further testing is needed to determine if building square footage should be transformed as well.

newdata %>% 
  ggplot(aes(log(TOTALVAL), BLDGSQFT)) +
  geom_point()

newdata %>% 
  ggplot(aes(log(TOTALVAL), log(BLDGSQFT))) +
  geom_point()

newdata %>% 
  ggplot(aes(TOTALVAL, log(BLDGSQFT))) +
  geom_point()

Question 3

  1. Multiple R-squared: 0.5369, Adjusted R-squared: 0.5364

  2. Multiple R-squared: 0.3927, Adjusted R-squared: 0.3921

  3. Multiple R-squared: 0.4095, Adjusted R-squared: 0.4089

  4. Multiple R-squared: 0.4108, Adjusted R-squared: 0.4102

To determine the best model we look for the highest Rsquared value. In these models, the model a has the highest R-squared value, meaning that model is the best fit. This is a bit confusing because the previous testing showed a good fit by transforming the total value variable using a log function. I think that model could be a good fit, but some outliers are pulling the linear model in both models. I’m not sure which is the best model so far - more testing needs to be completed!

lma <- lm(TOTALVAL ~ BLDGSQFT, newdata) %>% 
  summary()
lmb <- lm(log(TOTALVAL) ~ BLDGSQFT, newdata) %>% 
  summary()
lmc <- lm(log(TOTALVAL) ~ log(BLDGSQFT), newdata) %>% 
  summary()
lmd <- lm(TOTALVAL ~ log(BLDGSQFT), newdata) %>% 
  summary()

Question 4

  1. Residuals vs Fitted plot shows the data are not linearly related. The residual errors increase dramtically as X increases

  2. The plot shows data errors are normally distributed. The line is about a 45 degree angle with some departure at the beginning and end, which is expected.

  3. These data are not heteroskedastic according to the scale-location scatter plot because the fit line is not horizontal and trends signifigantly up as X increases.

  4. Some data seems to be exerting a large influence on the data because some points fall outside the Cook’s .5 line. This is in line with previous tests which show a large group of data points clustered with some pulling the line signifigantly steeper. This test is helping us determine which points are throwing off the fit line.

lm_resfit <- lm(TOTALVAL ~ BLDGSQFT, newdata)
plot(lm_resfit, ask=FALSE)

Question 5

  1. not a good fit

  2. also not a good fit

  3. better…?

  4. not the best

Based on these scatterplots I think c, or log-log, is the best fit, but it’s really hard to tell.

newdata %>% 
  ggplot(aes(TOTALVAL, dpioneer)) +
  geom_point()

newdata %>% 
  ggplot(aes(log(TOTALVAL), dpioneer)) +
  geom_point()

newdata %>% 
  ggplot(aes(log(TOTALVAL), log(dpioneer))) +
  geom_point()

newdata %>% 
  ggplot(aes(TOTALVAL, log(dpioneer))) +
  geom_point()

Question 6

  1. Multiple R-squared: 0.1276, Adjusted R-squared: 0.1267

  2. Multiple R-squared: 0.2867, Adjusted R-squared: 0.286

  3. Multiple R-squared: 0.25, Adjusted R-squared: 0.2492

  4. Multiple R-squared: 0.1263, Adjusted R-squared: 0.1254

Based on the R-squared stats model b, or log-lin, is the best model.

lma1 <- lm(TOTALVAL ~ dpioneer, newdata) %>% 
  summary()
lmb1 <- lm(log(TOTALVAL) ~ dpioneer, newdata) %>% 
  summary()
lmc1 <- lm(log(TOTALVAL) ~ log(dpioneer), newdata) %>% 
  summary()
lmd1 <- lm(TOTALVAL ~ log(dpioneer), newdata) %>% 
  summary()

Question 7

  1. not a good fit - one or more of the variables needs to be transformed using a log

  2. this could be an okay fit.

  3. better…? data are grouped in a cloud in the center of the plot

  4. fairly linear, but difficult to tell.

Based on these scatterplots I think c, or log-log, is the best fit, but it’s really hard to tell.

newdata %>% 
  ggplot(aes(TOTALVAL, dmax)) +
  geom_point()

newdata %>% 
  ggplot(aes(log(TOTALVAL), dmax)) +
  geom_point()

newdata %>% 
  ggplot(aes(log(TOTALVAL), log(dmax))) +
  geom_point()

newdata %>% 
  ggplot(aes(TOTALVAL, log(dmax))) +
  geom_point()

Question 8

  1. Multiple R-squared: 0.05245, Adjusted R-squared: 0.05148

  2. Multiple R-squared: 0.09992, Adjusted R-squared: 0.099

  3. Multiple R-squared: 0.07203, Adjusted R-squared: 0.07109

  4. Multiple R-squared: 0.03544, Adjusted R-squared: 0.03446

  5. Multiple R-squared: 0.05565, Adjusted R-squared: 0.05372

Based on the R-squared stats model b, or log-lin, is the best model.

lma2 <- lm(TOTALVAL ~ dmax, newdata) %>% 
  summary()
lmb2 <- lm(log(TOTALVAL) ~ dmax, newdata) %>% 
  summary()
lmc2 <- lm(log(TOTALVAL) ~ log(dmax), newdata) %>% 
  summary()
lmd2 <- lm(TOTALVAL ~ log(dmax), newdata) %>% 
  summary()
lme2 <- lm(TOTALVAL ~ dmax+I(dmax^2), newdata) %>% 
  summary()

Question 9

After testing the pairwise models it’s unclear if the best model is a linear representaiton of TOTALVAL or a log of TOTALVAL, so regression test is useful. After running the regression test it’s obvious log(TOTALVAL) is a better fit. TOTALVAL : Multiple R-squared: 0.7036 log(TOTALVAL): Multiple R-squared: 0.7284 72% > 70%, so log(TOTALVAL) is a better model

  1. all VIF are below 2 so multicollinarity doesn’t seem to be present.

  2. the line is roughly horizontal, meaning the means of the residuals are fairly linear

  3. according to the Q-Q plot, the data are fairly normal. The points bascially fall along a 45 degree line with some departure at the ends of the data.

  4. *************** I don’t know how to interepret the breusch pagan test

lm(TOTALVAL ~ BLDGSQFT + dpioneer + dmax, newdata) %>% 
summary
## 
## Call:
## lm(formula = TOTALVAL ~ BLDGSQFT + dpioneer + dmax, data = newdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1116139   -62708   -10343    42980  1629457 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69922.75   11546.19   6.056 1.99e-09 ***
## BLDGSQFT       340.62       7.82  43.558  < 2e-16 ***
## dpioneer    -47182.66    2520.30 -18.721  < 2e-16 ***
## dmax         -6382.66    7065.21  -0.903    0.367    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 154200 on 978 degrees of freedom
## Multiple R-squared:  0.7036, Adjusted R-squared:  0.7027 
## F-statistic: 773.7 on 3 and 978 DF,  p-value: < 2.2e-16
lm(log(TOTALVAL) ~ BLDGSQFT + dpioneer + dmax, newdata) %>% 
summary
## 
## Call:
## lm(formula = log(TOTALVAL) ~ BLDGSQFT + dpioneer + dmax, data = newdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7015 -0.1706  0.0204  0.1964  0.8372 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.214e+01  2.461e-02 493.131   <2e-16 ***
## BLDGSQFT     6.648e-04  1.667e-05  39.880   <2e-16 ***
## dpioneer    -1.534e-01  5.373e-03 -28.559   <2e-16 ***
## dmax         3.764e-04  1.506e-02   0.025     0.98    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3288 on 978 degrees of freedom
## Multiple R-squared:  0.7284, Adjusted R-squared:  0.7276 
## F-statistic: 874.5 on 3 and 978 DF,  p-value: < 2.2e-16
lm_TFPD <- lm(log(TOTALVAL) ~ BLDGSQFT + dpioneer + dmax, newdata)

ols_vif_tol(lm_TFPD)
## # A tibble: 3 x 3
##   Variables Tolerance   VIF
##   <chr>         <dbl> <dbl>
## 1 BLDGSQFT      0.995  1.01
## 2 dpioneer      0.671  1.49
## 3 dmax          0.673  1.48
plot(lm_TFPD, ask=FALSE)

ols_test_normality(lm_TFPD)
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.8904         0.0000 
## Kolmogorov-Smirnov        0.0686          2e-04 
## Cramer-von Mises         173.7969        0.0270 
## Anderson-Darling          8.0214         0.0000 
## -----------------------------------------------
ols_test_breusch_pagan(lm_TFPD)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                   Data                    
##  -----------------------------------------
##  Response : log(TOTALVAL) 
##  Variables: fitted values of log(TOTALVAL) 
## 
##           Test Summary           
##  --------------------------------
##  DF            =    1 
##  Chi2          =    994.1349 
##  Prob > Chi2   =    3.381657e-218

Question 9.5

vcov0 <- vcov(lm_TFPD)
require(car)
vcov_hc3 <- hccm(lm_TFPD, type = "hc3")
if(!require(lmtest))
  install.packages("lmtest") & library(lmtest)
coeftest(lm_TFPD, vcov_hc3)
## 
## t test of coefficients:
## 
##                Estimate  Std. Error  t value  Pr(>|t|)    
## (Intercept) 12.13818343  0.10849546 111.8773 < 2.2e-16 ***
## BLDGSQFT     0.00066482  0.00010277   6.4688 1.558e-10 ***
## dpioneer    -0.15344513  0.00657837 -23.3257 < 2.2e-16 ***
## dmax         0.00037642  0.01361227   0.0277    0.9779    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question 10

  1. discuss the t-test and significance of each coefficent

  2. explain how each coefficient relations to totalval

  3. goodness of fit (r2)

  4. limitations of the model